We have two dataset for Bottle Database in CalCOFI
bottle <-read.csv('194903-202105_Bottle.csv')
cast <- read.csv('194903-202105_Cast.csv')
getwd()
## [1] "C:/Users/gvnal/OneDrive/MasaĂĽstĂĽ"
In each dipping process, the bottles go deeper step by step, and after reaching a certain point, another dipping process is started, which starts to go down from 0 meters to the depths, step by step, and this repeats.
summary(bottle)
## Cst_Cnt Btl_Cnt Sta_ID Depth_ID
## Min. : 1 Min. : 1 Length:895371 Length:895371
## 1st Qu.: 8551 1st Qu.:223844 Class :character Class :character
## Median :17665 Median :447686 Mode :character Mode :character
## Mean :17748 Mean :447686
## 3rd Qu.:27426 3rd Qu.:671529
## Max. :35644 Max. :895398
##
## Depthm T_degC Salnty O2ml_L
## Min. : 0.0 Min. : 1.44 Min. : 9.50 Min. :-200.22
## 1st Qu.: 45.0 1st Qu.: 7.71 1st Qu.:33.48 1st Qu.: 1.37
## Median : 125.0 Median :10.07 Median :33.86 Median : 3.45
## Mean : 224.4 Mean :10.82 Mean :33.83 Mean : 3.40
## 3rd Qu.: 300.0 3rd Qu.:13.90 3rd Qu.:34.19 3rd Qu.: 5.50
## Max. :5351.0 Max. :99.00 Max. :37.03 Max. : 11.13
## NA's :10969 NA's :47356 NA's :169741
## STheta O2Sat Oxy_.mol.Kg BtlNum
## Min. : 17.00 Min. :-3605.22 Min. :-8740.57 Min. : 0.0
## 1st Qu.: 24.96 1st Qu.: 21.30 1st Qu.: 61.77 1st Qu.: 5.0
## Median : 25.99 Median : 54.50 Median : 151.49 Median :10.0
## Mean : 25.81 Mean : 57.23 Mean : 149.11 Mean :10.4
## 3rd Qu.: 26.64 3rd Qu.: 97.70 3rd Qu.: 240.58 3rd Qu.:16.0
## Max. :250.78 Max. : 214.10 Max. : 485.70 Max. :25.0
## NA's :52696 NA's :204625 NA's :204673 NA's :756404
## RecInd T_prec T_qual S_prec
## Min. :3.000 Min. : 1.000 Min. :0.0 Min. : 1.00
## 1st Qu.:3.000 1st Qu.: 2.000 1st Qu.:6.0 1st Qu.: 2.00
## Median :3.000 Median : 2.000 Median :6.0 Median : 3.00
## Mean :4.692 Mean : 2.049 Mean :7.3 Mean : 2.78
## 3rd Qu.:7.000 3rd Qu.: 2.000 3rd Qu.:9.0 3rd Qu.: 3.00
## Max. :7.000 Max. :34.000 Max. :9.0 Max. :146.00
## NA's :10963 NA's :871355 NA's :47387
## S_qual P_qual O_qual SThtaq
## Min. : 4.0 Min. :3.00 Min. :6.0 Min. :6.0
## 1st Qu.: 6.0 1st Qu.:9.00 1st Qu.:9.0 1st Qu.:6.0
## Median : 9.0 Median :9.00 Median :9.0 Median :9.0
## Mean : 9.2 Mean :8.99 Mean :8.6 Mean :8.1
## 3rd Qu.: 9.0 3rd Qu.:9.00 3rd Qu.:9.0 3rd Qu.:9.0
## Max. :344.0 Max. :9.00 Max. :9.0 Max. :9.0
## NA's :809756 NA's :220697 NA's :695736 NA's :818947
## O2Satq ChlorA Chlqua Phaeop
## Min. :2.0 Min. : 0.0 Min. :8 Min. :-3.9
## 1st Qu.:9.0 1st Qu.: 0.0 1st Qu.:9 1st Qu.: 0.0
## Median :9.0 Median : 0.2 Median :9 Median : 0.1
## Mean :8.6 Mean : 0.5 Mean :9 Mean : 0.2
## 3rd Qu.:9.0 3rd Qu.: 0.4 3rd Qu.:9 3rd Qu.: 0.2
## Max. :9.0 Max. :66.1 Max. :9 Max. :65.3
## NA's :662620 NA's :649705 NA's :245686 NA's :649706
## Phaqua PO4uM PO4q SiO3uM
## Min. :8 Min. :0.0 Min. :4 Min. : 0.0
## 1st Qu.:9 1st Qu.:0.5 1st Qu.:9 1st Qu.: 3.1
## Median :9 Median :1.6 Median :9 Median : 18.0
## Mean :9 Mean :1.6 Mean :9 Mean : 26.5
## 3rd Qu.:9 3rd Qu.:2.5 3rd Qu.:9 3rd Qu.: 41.6
## Max. :9 Max. :5.2 Max. :9 Max. :196.0
## NA's :245682 NA's :454462 NA's :440666 NA's :513686
## SiO3qu NO2uM NO2q NO3uM
## Min. :4 Min. :0.0 Min. :4 Min. :-0.4
## 1st Qu.:9 1st Qu.:0.0 1st Qu.:9 1st Qu.: 0.6
## Median :9 Median :0.0 Median :9 Median :18.1
## Mean :9 Mean :0.0 Mean :9 Mean :17.3
## 3rd Qu.:9 3rd Qu.:0.0 3rd Qu.:9 3rd Qu.:30.0
## Max. :9 Max. :8.2 Max. :9 Max. :95.0
## NA's :381514 NA's :530929 NA's :346291 NA's :530373
## NO3q NH3uM NH3q C14As1
## Min. :4 Min. : 0.0 Min. :4.00 Min. : -0.2
## 1st Qu.:9 1st Qu.: 0.0 1st Qu.:9.00 1st Qu.: 0.9
## Median :9 Median : 0.0 Median :9.00 Median : 2.7
## Mean :9 Mean : 0.1 Mean :8.83 Mean : 9.9
## 3rd Qu.:9 3rd Qu.: 0.1 3rd Qu.:9.00 3rd Qu.: 8.1
## Max. :9 Max. :33.6 Max. :9.00 Max. :584.5
## NA's :357845 NA's :804882 NA's :62864 NA's :879829
## C14A1p C14A1q C14As2 C14A2p
## Min. :1.0 Min. :8 Min. : -0.2 Min. :1.0
## 1st Qu.:1.0 1st Qu.:9 1st Qu.: 0.9 1st Qu.:1.0
## Median :1.0 Median :9 Median : 2.6 Median :1.0
## Mean :1.3 Mean :9 Mean : 9.9 Mean :1.3
## 3rd Qu.:2.0 3rd Qu.:9 3rd Qu.: 8.2 3rd Qu.:2.0
## Max. :2.0 Max. :9 Max. :948.3 Max. :2.0
## NA's :882611 NA's :17377 NA's :879847 NA's :882629
## C14A2q DarkAs DarkAp DarkAq
## Min. :8 Min. :0.0 Min. :1 Min. :8
## 1st Qu.:9 1st Qu.:0.1 1st Qu.:2 1st Qu.:9
## Median :9 Median :0.1 Median :2 Median :9
## Mean :9 Mean :0.2 Mean :2 Mean :9
## 3rd Qu.:9 3rd Qu.:0.2 3rd Qu.:2 3rd Qu.:9
## Max. :9 Max. :9.8 Max. :2 Max. :9
## NA's :17359 NA's :871612 NA's :874914 NA's :25542
## MeanAs MeanAp MeanAq IncTim
## Min. : -0.2 Min. :1.0 Min. :8 Length:895371
## 1st Qu.: 1.0 1st Qu.:1.0 1st Qu.:9 Class :character
## Median : 2.5 Median :1.0 Median :9 Mode :character
## Mean : 8.5 Mean :1.3 Mean :9
## 3rd Qu.: 7.1 3rd Qu.:2.0 3rd Qu.:9
## Max. :948.3 Max. :2.0 Max. :9
## NA's :871611 NA's :874914 NA's :25543
## LightP R_Depth R_TEMP R_Sal
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.4
## 1st Qu.: 0.2 1st Qu.: 45.0 1st Qu.: 7.76 1st Qu.: 144.1
## Median : 1.4 Median : 126.0 Median :10.12 Median : 203.9
## Mean :17.3 Mean : 225.9 Mean :10.86 Mean : 221.4
## 3rd Qu.:21.0 3rd Qu.: 302.0 3rd Qu.:13.94 3rd Qu.: 300.2
## Max. :99.9 Max. :5458.0 Max. :40.04 Max. :5053.5
## NA's :875604 NA's :1 NA's :46054 NA's :52803
## R_DYNHT R_Nuts R_Oxy_.mol.Kg DIC1
## Min. :0.00 Min. : 0.0 Min. :-8740.57 Min. :1949
## 1st Qu.:0.13 1st Qu.: 0.0 1st Qu.: 61.77 1st Qu.:2028
## Median :0.34 Median : 0.0 Median : 151.49 Median :2171
## Mean :0.43 Mean : 0.1 Mean : 149.11 Mean :2153
## 3rd Qu.:0.63 3rd Qu.: 0.1 3rd Qu.: 240.58 3rd Qu.:2254
## Max. :3.88 Max. :33.6 Max. : 485.70 Max. :2368
## NA's :46688 NA's :804862 NA's :204673 NA's :893372
## DIC2 TA1 TA2 pH1
## Min. :1969 Min. :2182 Min. :2198 Min. :7.6
## 1st Qu.:2009 1st Qu.:2230 1st Qu.:2229 1st Qu.:7.9
## Median :2266 Median :2244 Median :2248 Median :7.9
## Mean :2168 Mean :2256 Mean :2279 Mean :7.9
## 3rd Qu.:2316 3rd Qu.:2278 3rd Qu.:2316 3rd Qu.:8.0
## Max. :2364 Max. :2435 Max. :2437 Max. :8.0
## NA's :895147 NA's :893287 NA's :895137 NA's :895287
## pH2 DIC.Quality.Comment
## Min. :7.9 Length:895371
## 1st Qu.:7.9 Class :character
## Median :7.9 Mode :character
## Mean :7.9
## 3rd Qu.:8.0
## Max. :8.0
## NA's :895361
qual, qua, or q (Quality Code - found in Bottle table; associated with discrete samples; examples: “O_qual”, “Chlqua”, “PO4q”):
We can not do na.omit because NA’s are valuable data !
bottle$O_qual.f<-factor(bottle$O_qual)
summary(bottle$O_qual.f)
## 6 8 9 NA's
## 28439 1455 169741 695736
As we can see there is 169741 missing data and 1455 technician thinks value is suspect rows for O_qual
qualityCode<-select(bottle,ends_with(c('q','qua','_qual')))
summary(qualityCode)
## SThtaq O2Satq PO4q NO2q
## Min. :6.0 Min. :2.0 Min. :4 Min. :4
## 1st Qu.:6.0 1st Qu.:9.0 1st Qu.:9 1st Qu.:9
## Median :9.0 Median :9.0 Median :9 Median :9
## Mean :8.1 Mean :8.6 Mean :9 Mean :9
## 3rd Qu.:9.0 3rd Qu.:9.0 3rd Qu.:9 3rd Qu.:9
## Max. :9.0 Max. :9.0 Max. :9 Max. :9
## NA's :818947 NA's :662620 NA's :440666 NA's :346291
## NO3q NH3q C14A1q C14A2q
## Min. :4 Min. :4.00 Min. :8 Min. :8
## 1st Qu.:9 1st Qu.:9.00 1st Qu.:9 1st Qu.:9
## Median :9 Median :9.00 Median :9 Median :9
## Mean :9 Mean :8.83 Mean :9 Mean :9
## 3rd Qu.:9 3rd Qu.:9.00 3rd Qu.:9 3rd Qu.:9
## Max. :9 Max. :9.00 Max. :9 Max. :9
## NA's :357845 NA's :62864 NA's :17377 NA's :17359
## DarkAq MeanAq Chlqua Phaqua
## Min. :8 Min. :8 Min. :8 Min. :8
## 1st Qu.:9 1st Qu.:9 1st Qu.:9 1st Qu.:9
## Median :9 Median :9 Median :9 Median :9
## Mean :9 Mean :9 Mean :9 Mean :9
## 3rd Qu.:9 3rd Qu.:9 3rd Qu.:9 3rd Qu.:9
## Max. :9 Max. :9 Max. :9 Max. :9
## NA's :25542 NA's :25543 NA's :245686 NA's :245682
## T_qual S_qual P_qual O_qual
## Min. :0.0 Min. : 4.0 Min. :3.00 Min. :6.0
## 1st Qu.:6.0 1st Qu.: 6.0 1st Qu.:9.00 1st Qu.:9.0
## Median :6.0 Median : 9.0 Median :9.00 Median :9.0
## Mean :7.3 Mean : 9.2 Mean :8.99 Mean :8.6
## 3rd Qu.:9.0 3rd Qu.: 9.0 3rd Qu.:9.00 3rd Qu.:9.0
## Max. :9.0 Max. :344.0 Max. :9.00 Max. :9.0
## NA's :871355 NA's :809756 NA's :220697 NA's :695736
From here S_qual has max 323, so we can not say every column that has qual, qua, or q is categorical variable. Table needed to inspected carefully to convert categorical variables into factor.
RecInd (Record Indicator - found in Bottle table):
Interpolated: It is the general name given to all the methods used to find/estimate the possible value at a point whose value is unknown.
bottle$RecInd.f<-factor(bottle$RecInd)
summary(bottle$RecInd.f)
## 3 4 5 6 7
## 474987 2 81177 4202 335003
So we can say 335003 data is interpolated in this dataset and there is 474987 observed data
Predicted values are being removed from the dataset. We plan to focus on measured values rather than estimated values during analysis.
bottle<-filter(bottle,RecInd!=7)
Also wanted to remove duplicates from both cast and bottle table. “unique.data.frame” function basically removing duplicate rows from the table if exist.
bottle<-unique.data.frame(bottle)
cast<-unique.data.frame(cast)
Wanted to count how many observations ships have had by descending arrange
orderingShipCounts <- cast %>%
group_by(Ship_Name) %>%
summarise(count = n()) %>%
arrange(desc(count))
Getting top 10 ship in figure 1, from this graph RV David Starr Jordan is the most ship used for observations
top10Data <- head(orderingShipCounts, 10)
ggplot(top10Data, aes(x = reorder(Ship_Name, -count), y = count)) +
geom_bar(stat = "identity",fill="Orange") +
labs(title = "Top 10 Ship Names",x = "Ship Name",y = "Count")+
theme(axis.text.x = element_text(angle = 50, hjust = 1)) # To make labels with 50 degree we use this and for labels next to X line adding hjust=1
Figure 1:Top 10 ship names used for measurement
Well, we can draw a histogram to find out in which year the most measurements were made. In this way, we can see in which year the most measurements were made.
ggplot(cast,aes(Year))+
geom_histogram(binwidth = 5,color='red',fill='orange')
Figure 2: Measurement counts by year
From figure 2 we obtain a decreasing number of measurements since 1950. While there were over 4000 measurements in the first years, this number decreased to around 1500 measurements in the 2000s.
We can draw a bar chart to see the data types better. In this way, we can more easily observe which data type is available and how much.
Data_Type (Data Type - found in Cast table)
cast$Data_Type.f<-factor(cast$Data_Type)
ggplot(cast,aes(Data_Type.f))+
geom_bar(fill='orange')+
labs(x="Data Type", y="Count")
Figure 3: Data type counts
We can observe from figure 3 that hydrographic data type is the most common. We see that there is at least 10 meters of cast. We can see that there is also a lot of low resolution data here.
Also wanted to check station codes in bar graph to observe which station measured most.
Sta_Code (Station Codes - found in Cast table)
cast$Sta_Code.f<-factor(cast$Sta_Code)
ggplot(cast,aes(Sta_Code.f))+
geom_bar(fill='orange')+
labs(x="Stations", y="Count")
Figure 4: Data type counts
As we can see from figure 4, the most data is from the CalCofi station. However, not only calcofi station but also IMECOCAL station is in the dataset.
We can use a contingency table to find out how comprehensively the stations can search. The cast table contains the codes of the stations and the types of data. From here, we can learn which station extracted how much data and which data types were measured the most.
contTable= prop.table(table(cast$Sta_Code,cast$Data_Type))*100
contTable
##
## 10 CT HY MX PR
## IMX 0.002805521 3.024351925 13.275726630 0.000000000 0.064526989
## MBR 0.000000000 0.252496914 0.819212210 0.187969925 0.022444170
## NRO 0.002805521 1.596341600 9.019750870 0.025249691 0.078554595
## NST 0.931433060 4.724497812 13.696554820 0.196386489 0.535854562
## OCO 0.005611043 0.937044103 3.851980698 0.137470542 0.067332510
## SCO 0.000000000 0.000000000 0.000000000 0.653686455 0.000000000
## ST 0.098193244 3.007518797 19.641454382 17.489619571 5.653125351
From figure 5, we see that the most hydrographic data is provided by standard CalCOFI station, followed by “IMECOCAL” and “Non-Standard” stations with the most hydrographic data. The standard CalCOFI station ranks first in every data type. 10 meter cast is not very common in our dataset. “Mixed CTD and Bottle Data” type is provided only and largely from Standard CalCOFI Station. “MBARI” stations do not contribute much to this dataset.
Low resolution data types can be removed from the data set because they may provide incorrect information.
model<- lm(Salnty~Depthm,bottle)
summary(model)
##
## Call:
## lm(formula = Salnty ~ Depthm, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4332 -0.2086 0.0414 0.2126 3.4043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.362e+01 7.439e-04 45194.2 <2e-16 ***
## Depthm 9.368e-04 2.161e-06 433.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4423 on 535766 degrees of freedom
## (24600 observations deleted due to missingness)
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2597
## F-statistic: 1.879e+05 on 1 and 535766 DF, p-value: < 2.2e-16
Adjusted R square is about 0.2597 which is not very accurate
Maybe temperature and depth effects salinity
model2<-lm(Salnty~Depthm+T_degC,bottle)
summary(model2)
##
## Call:
## lm(formula = Salnty ~ Depthm + T_degC, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3310 -0.1898 0.0303 0.1777 3.4057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.389e+01 2.670e-03 12693.3 <2e-16 ***
## Depthm 7.330e-04 2.902e-06 252.6 <2e-16 ***
## T_degC -2.040e-02 1.962e-04 -104.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.438 on 532520 degrees of freedom
## (27845 observations deleted due to missingness)
## Multiple R-squared: 0.2745, Adjusted R-squared: 0.2745
## F-statistic: 1.007e+05 on 2 and 532520 DF, p-value: < 2.2e-16
Adjusted R square increased into 0.2734. Formula for salinity is salinity=3.395e+01+6.488e-04xdepth-2.416e-02xtemperature in this linear modal.
ggplot(bottle,aes(Depthm,Salnty))+
geom_point(alpha=0.2)
Figure 6: Scatter plot for depth and salinity
From figure 6 as the depth increases, the salinity rate becomes more stable and remains constant. When we reach 4000 meters, the salinity remains constant, but at 0 meters, the salinity is quite variable. From this we can understand that salinity stabilizes as we deepen. At the same time, the cluster at the bottom of the chart may be showing us the outliers. It is useful to draw a boxplot for this, so we can visually see the outliers.
ggplot(bottle,aes(y=Salnty))+
geom_boxplot()
Figure 7: Box plot for salinity
In figure 7 can be seen, there are too many outliers. The IQR method will be applied to remove these outliers and the graph will be printed again, so we hope that the underlying cluster will disappear and provide us with a more accurate graph.
q1 <- quantile(bottle$Salnty, 0.25,na.rm=TRUE)
q3 <- quantile(bottle$Salnty, 0.75,na.rm=TRUE)
IQR<-IQR(bottle$Salnty,na.rm=TRUE)
lower_bound <- q1 - 1.5 * IQR
upper_bound <- q3 + 1.5 * IQR
bottle<-filter(bottle, Salnty>=lower_bound&Salnty<=upper_bound)
After clearing the outlines, 27289 rows are deleted from our data set. That means there were 27289 outliers.
ggplot(bottle,aes(Depthm,Salnty))+
geom_point(alpha=0.2)
Figure 8: Scatter plot for depth and salinity
When we plot the graph again ( figure 8 ), we realize that the cluster at the bottom has disappeared, giving us a more accurate graph. At the same time, we observe here again the stabilization of salinity as we get deeper, which we mentioned at first.
summary(lm(Salnty~Depthm,bottle))
##
## Call:
## lm(formula = Salnty ~ Depthm, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9723 -0.2103 0.0384 0.2087 1.5879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.362e+01 6.054e-04 55540.5 <2e-16 ***
## Depthm 9.400e-04 1.755e-06 535.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3588 on 533077 degrees of freedom
## Multiple R-squared: 0.35, Adjusted R-squared: 0.35
## F-statistic: 2.871e+05 on 1 and 533077 DF, p-value: < 2.2e-16
Also adjusted R squared has increased, bottle will be used in the rest of the analysis.
We wanted to check salinity vs depth graph for each ship. To do this analysis we need to merge two table, Sta_ID is the common key for these two table so we can merge bottle and cast table together by Sta_ID
cast_aggregated <- cast %>%
group_by(Sta_ID) %>%
summarize(Ship_Name = first(Ship_Name))
First grouping Ship_Names by Sta_ID, so when R checking Sta_ID will be represent for ship name
merged_table <- left_join(bottle, cast_aggregated, by = 'Sta_ID')
Then merging this two table by Sta_ID, only Ship_Name is added into end of the bottle table
ggplot(bottle,aes(Depthm,Salnty))+
geom_point()+
facet_wrap(~merged_table$Ship_Name)
Figure 9: Scatter plot for depth and salinity seperated by ship names
Using facet wrap to check salinity vs depth graph seperately by ship names From figure 9 we can say Ekvator, Ellen B. Scripps, Westwind and Yellowfin these ships did not dip the bottles any deeper during the measurement. Deepest measurement has done by RV Alexander Agassiz and Horizon
ggplot(bottle,aes(Depthm,T_degC))+
geom_point()
Figure 10: Scatter plot for depth and temperature
When we examine the graph of temperature and depth in figure 10, we observe that as we go deeper, the temperature approaches zero and decreases exponentially.
In the graph, we see an additional bulge in this exponential decrease. We plot an additional graph to observe which ships are responsible for this protrusion.
ggplot(bottle,aes(Depthm,T_degC))+
geom_point()+
facet_wrap(~merged_table$Ship_Name)
Figure 11: Scatter plot for depth and temperature by ship names
From figure 11, we observe that the ships RV Alexander Agassiz, Black Douglas, Spencer F. Baird and Stranger measured these salinity rates incorrectly.
We want to see how salinity relates to other things they measure. This is really important for studying the ocean. By saying that what we are trying to solve for is salinity as the dependent variable, we can see how different things in the environment change the salinity. We will see how and to what extent molecules affect salinity with the regression model.
To see whether molecules affect the temperature, we take the temperature as a dependent variable and consider the molecules as independent. As a result of this model, we aim to see which molecules affect the temperature and to what extent.
We created a model to learn what determines wave heights. In addition to the wind effect, how often waves are formed per second, atmospheric pressure, their direction, etc. we took these as an independent variables first. As a result of this model, we will learn which variable affects how much wave height.
correlationMatrix <- cor(bottle[,c("T_degC","Salnty","Depthm","Phaeop","PO4uM","SiO3uM","NO2uM","NO3uM","NH3uM","DarkAs")], use = "complete.obs")
corrplot(correlationMatrix, method = "circle")
Figure 12: Correlation Matrix
cor.test(bottle$SiO3uM,bottle$PO4uM)
##
## Pearson's product-moment correlation
##
## data: bottle$SiO3uM and bottle$PO4uM
## t = 1177.4, df = 233054, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9246615 0.9258301
## sample estimates:
## cor
## 0.925248
cor.test(bottle$T_degC,bottle$PO4uM)
##
## Pearson's product-moment correlation
##
## data: bottle$T_degC and bottle$PO4uM
## t = -1101.1, df = 254311, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9098530 -0.9085052
## sample estimates:
## cor
## -0.9091815
In figure 12, we can see the correlation between the variables. The darker the color, the more correlated the variables are. The lighter the color, the less correlated the variables are. The diagonal line shows the correlation of the variables with themselves. The correlation between the variables is not very high. The highest correlation is between SiO3uM and PO4uM. The correlation between these two variables is 0.954. Most negative correlation is PO4uM and T_degC got -0.904.
SalDepth<-lm(Salnty~Depthm+ChlorA+Phaeop+PO4uM+SiO3uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(SalDepth)
##
## Call:
## lm(formula = Salnty ~ Depthm + ChlorA + Phaeop + PO4uM + SiO3uM +
## NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00653 -0.09487 0.00774 0.10265 0.52308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.367e+01 1.313e-02 2563.164 < 2e-16 ***
## Depthm -1.232e-03 9.736e-05 -12.650 < 2e-16 ***
## ChlorA 6.183e-03 2.221e-03 2.784 0.0054 **
## Phaeop 1.215e-01 1.309e-02 9.279 < 2e-16 ***
## PO4uM -1.257e+00 4.457e-02 -28.205 < 2e-16 ***
## SiO3uM 1.705e-02 2.153e-03 7.919 3.08e-15 ***
## NO2uM 4.262e-02 2.821e-02 1.511 0.1309
## NO3uM 8.426e-02 3.005e-03 28.040 < 2e-16 ***
## NH3uM 8.047e-02 8.386e-03 9.596 < 2e-16 ***
## DarkAs 4.039e-02 7.190e-03 5.618 2.07e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1563 on 3992 degrees of freedom
## (529077 observations deleted due to missingness)
## Multiple R-squared: 0.3739, Adjusted R-squared: 0.3725
## F-statistic: 264.9 on 9 and 3992 DF, p-value: < 2.2e-16
In first model, we used all variables to predict salinity. But our adjusted R-squared is pretty low so we decided to remove some variables and try again. When we check the significance codes of the variables we can see that NO2uM is not significant. So we decided to remove this variable and try again.
SalDepth2<-lm(Salnty~Depthm+ChlorA+Phaeop+PO4uM+SiO3uM+NO3uM+NH3uM+DarkAs,bottle)
summary(SalDepth2)
##
## Call:
## lm(formula = Salnty ~ Depthm + ChlorA + Phaeop + PO4uM + SiO3uM +
## NO3uM + NH3uM + DarkAs, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01694 -0.09501 0.00883 0.10298 0.52552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.366e+01 1.281e-02 2626.826 < 2e-16 ***
## Depthm -1.234e-03 9.736e-05 -12.675 < 2e-16 ***
## ChlorA 6.125e-03 2.221e-03 2.758 0.00585 **
## Phaeop 1.247e-01 1.292e-02 9.656 < 2e-16 ***
## PO4uM -1.237e+00 4.259e-02 -29.051 < 2e-16 ***
## SiO3uM 1.669e-02 2.140e-03 7.799 7.93e-15 ***
## NO3uM 8.351e-02 2.965e-03 28.167 < 2e-16 ***
## NH3uM 8.397e-02 8.059e-03 10.420 < 2e-16 ***
## DarkAs 4.042e-02 7.191e-03 5.620 2.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1564 on 3993 degrees of freedom
## (529077 observations deleted due to missingness)
## Multiple R-squared: 0.3735, Adjusted R-squared: 0.3723
## F-statistic: 297.6 on 8 and 3993 DF, p-value: < 2.2e-16
In second iteration, we removed NO2uM variable and our adjusted R-squared decreased little bit. But we can see that all variables are significant. Wanted R-squared value higher, so we decided to remove ChlorA variable and try again. The reason why ChlorA removed is that it has the highest p-value among the significant variables.
SalDepth3<-lm(Salnty~Depthm+Phaeop+PO4uM+SiO3uM+NO3uM+NH3uM+DarkAs,bottle)
summary(SalDepth3)
##
## Call:
## lm(formula = Salnty ~ Depthm + Phaeop + PO4uM + SiO3uM + NO3uM +
## NH3uM + DarkAs, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00655 -0.09422 0.00739 0.10312 0.52822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.366e+01 1.280e-02 2630.348 < 2e-16 ***
## Depthm -1.269e-03 9.664e-05 -13.126 < 2e-16 ***
## Phaeop 1.496e-01 9.263e-03 16.147 < 2e-16 ***
## PO4uM -1.251e+00 4.234e-02 -29.537 < 2e-16 ***
## SiO3uM 1.743e-02 2.125e-03 8.205 3.09e-16 ***
## NO3uM 8.376e-02 2.966e-03 28.240 < 2e-16 ***
## NH3uM 8.264e-02 8.051e-03 10.264 < 2e-16 ***
## DarkAs 4.344e-02 7.113e-03 6.106 1.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1565 on 3994 degrees of freedom
## (529077 observations deleted due to missingness)
## Multiple R-squared: 0.3724, Adjusted R-squared: 0.3713
## F-statistic: 338.5 on 7 and 3994 DF, p-value: < 2.2e-16
In third iteration, our adjusted R-squared decreased again. Need to remove another variable. We decided to remove Phaeop variable because it has the highest p-value among the significant variables. Removed DarkAs variable because it has the highest p-value among the significant variables.
SalDepth4<-lm(Salnty~Depthm+Phaeop+PO4uM+SiO3uM+NO3uM+NH3uM,bottle)
summary(SalDepth4)
##
## Call:
## lm(formula = Salnty ~ Depthm + Phaeop + PO4uM + SiO3uM + NO3uM +
## NH3uM, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92191 -0.08798 -0.00016 0.09308 0.81508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.350e+01 3.341e-03 10026.60 <2e-16 ***
## Depthm -8.037e-04 2.249e-05 -35.74 <2e-16 ***
## Phaeop 1.130e-01 3.018e-03 37.45 <2e-16 ***
## PO4uM -8.229e-01 1.040e-02 -79.15 <2e-16 ***
## SiO3uM 3.851e-02 3.125e-04 123.25 <2e-16 ***
## NO3uM 3.927e-02 6.777e-04 57.94 <2e-16 ***
## NH3uM 4.553e-02 2.002e-03 22.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1511 on 44165 degrees of freedom
## (488907 observations deleted due to missingness)
## Multiple R-squared: 0.7253, Adjusted R-squared: 0.7252
## F-statistic: 1.943e+04 on 6 and 44165 DF, p-value: < 2.2e-16
In fourth iteration, our adjusted R-squared increased to 0.7252 which means our model explains 72.52% of the variance in our dependent variable. We can see that all variables are significant. After that plotting residuals vs fitted values to check if there is pattern.
ggplot(SalDepth4, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals")
Figure 13: Residuals vs Fitted values
In figure 13, we can see that there is no pattern in residuals vs fitted values. So we can say that our model is good.
As a result, increase of depth, PO4uM, SiO3uM, NO3uM and NH3uM causes increase of salinity. Increase of Phaeop causes decrease of salinity. Highest increase is caused by PO4uM. Highest decrease is caused by Phaeop.
TempModel<-lm(T_degC~Depthm+Salnty+O2ml_L+ChlorA+Phaeop+PO4uM+SiO3uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(TempModel)
##
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + ChlorA + Phaeop +
## PO4uM + SiO3uM + NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5911 -0.6921 -0.0652 0.6074 6.5136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.574e+01 3.709e+00 -6.942 4.50e-12 ***
## Depthm -1.845e-02 6.519e-04 -28.310 < 2e-16 ***
## Salnty 1.703e+00 1.075e-01 15.849 < 2e-16 ***
## O2ml_L -2.200e+00 5.402e-02 -40.718 < 2e-16 ***
## ChlorA 1.174e-02 1.514e-02 0.776 0.438
## Phaeop -8.199e-01 8.740e-02 -9.381 < 2e-16 ***
## PO4uM -5.401e+00 3.202e-01 -16.869 < 2e-16 ***
## SiO3uM -2.328e-02 1.428e-02 -1.631 0.103
## NO2uM -1.371e+00 1.851e-01 -7.407 1.57e-13 ***
## NO3uM -1.244e-01 2.206e-02 -5.638 1.83e-08 ***
## NH3uM 2.317e-01 5.562e-02 4.166 3.17e-05 ***
## DarkAs 2.301e-01 4.742e-02 4.852 1.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 3987 degrees of freedom
## (529080 observations deleted due to missingness)
## Multiple R-squared: 0.8134, Adjusted R-squared: 0.8128
## F-statistic: 1580 on 11 and 3987 DF, p-value: < 2.2e-16
In first model, using all possible related variables to predict temperature. But our adjusted R-squared is pretty low so we decided to remove some variables and try again. When we check the significance codes of the variables we can see that SiO3uM is not significant. So we decided to remove this variable and try again.
TempModel2<-lm(T_degC~Depthm+Salnty+O2ml_L+ChlorA+Phaeop+PO4uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(TempModel2)
##
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + ChlorA + Phaeop +
## PO4uM + NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5766 -0.6977 -0.0657 0.6046 6.4734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.521e+01 3.695e+00 -6.822 1.03e-11 ***
## Depthm -1.838e-02 6.503e-04 -28.260 < 2e-16 ***
## Salnty 1.686e+00 1.070e-01 15.762 < 2e-16 ***
## O2ml_L -2.192e+00 5.384e-02 -40.717 < 2e-16 ***
## ChlorA 8.415e-03 1.500e-02 0.561 0.575
## Phaeop -8.053e-01 8.696e-02 -9.261 < 2e-16 ***
## PO4uM -5.569e+00 3.033e-01 -18.360 < 2e-16 ***
## NO2uM -1.338e+00 1.840e-01 -7.271 4.28e-13 ***
## NO3uM -1.320e-01 2.157e-02 -6.119 1.03e-09 ***
## NH3uM 2.383e-01 5.548e-02 4.295 1.79e-05 ***
## DarkAs 2.239e-01 4.728e-02 4.736 2.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 3988 degrees of freedom
## (529080 observations deleted due to missingness)
## Multiple R-squared: 0.8132, Adjusted R-squared: 0.8128
## F-statistic: 1737 on 10 and 3988 DF, p-value: < 2.2e-16
In second iteration, when we check the adjusted R-squared value, we can see our adjusted R-squared stay same. But we can see that ChlorA variable is not significant. So we decided to remove this variable and check model again.
TempModel3<-lm(T_degC~Depthm+Salnty+O2ml_L+Phaeop+PO4uM+NO2uM+NO3uM+NH3uM+DarkAs,bottle)
summary(TempModel3)
##
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + Phaeop + PO4uM +
## NO2uM + NO3uM + NH3uM + DarkAs, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5567 -0.6982 -0.0627 0.6053 6.4738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.550e+01 3.656e+00 -6.975 3.57e-12 ***
## Depthm -1.842e-02 6.463e-04 -28.497 < 2e-16 ***
## Salnty 1.693e+00 1.061e-01 15.961 < 2e-16 ***
## O2ml_L -2.184e+00 5.195e-02 -42.050 < 2e-16 ***
## Phaeop -7.757e-01 6.902e-02 -11.238 < 2e-16 ***
## PO4uM -5.575e+00 3.030e-01 -18.398 < 2e-16 ***
## NO2uM -1.342e+00 1.838e-01 -7.302 3.41e-13 ***
## NO3uM -1.308e-01 2.147e-02 -6.093 1.21e-09 ***
## NH3uM 2.359e-01 5.532e-02 4.265 2.05e-05 ***
## DarkAs 2.275e-01 4.683e-02 4.858 1.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 3989 degrees of freedom
## (529080 observations deleted due to missingness)
## Multiple R-squared: 0.8132, Adjusted R-squared: 0.8128
## F-statistic: 1930 on 9 and 3989 DF, p-value: < 2.2e-16
In third iteration R-squared value still same. But wanted to try remove DarkAs variable because in salinity model when we removed DarkAs our adjusted R-squared increased a lot. Also milligrams carbon per cubic meter of seawater might not be related to temperature. So we decided to remove this variable and check model again.
TempModel4<-lm(T_degC~Depthm+Salnty+O2ml_L+Phaeop+PO4uM+NO2uM+NO3uM+NH3uM,bottle)
summary(TempModel4)
##
## Call:
## lm(formula = T_degC ~ Depthm + Salnty + O2ml_L + Phaeop + PO4uM +
## NO2uM + NO3uM + NH3uM, data = bottle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3924 -0.5663 -0.0534 0.4511 13.4670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.037e+01 1.025e+00 -49.15 <2e-16 ***
## Depthm -1.409e-02 1.372e-04 -102.73 <2e-16 ***
## Salnty 2.293e+00 2.933e-02 78.18 <2e-16 ***
## O2ml_L -1.487e+00 1.394e-02 -106.64 <2e-16 ***
## Phaeop -8.882e-01 2.073e-02 -42.85 <2e-16 ***
## PO4uM -3.279e+00 6.902e-02 -47.51 <2e-16 ***
## NO2uM -1.782e+00 5.090e-02 -35.01 <2e-16 ***
## NO3uM -2.140e-01 4.436e-03 -48.24 <2e-16 ***
## NH3uM 1.511e-01 1.291e-02 11.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9353 on 44142 degrees of freedom
## (488928 observations deleted due to missingness)
## Multiple R-squared: 0.9022, Adjusted R-squared: 0.9022
## F-statistic: 5.09e+04 on 8 and 44142 DF, p-value: < 2.2e-16
After removing DarkAs variable, our adjusted R-squared increased to 0.9022. So our model pretty good. We can see that all variables are significant. After that plotting residuals vs fitted values to check if there is pattern.
ggplot(TempModel4, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals")
Figure 14: Residuals vs Fitted values
In figure 14, we can see that there is no pattern in residuals vs fitted values. So we can say that our model is good.
As a result, increase of depth, salinity, O2ml_L, Phaeop, PO4uM, NO2uM and NO3uM causes decrease of temperature. Increase of NH3uM causes increase of temperature. Highest decrease is caused by PO4uM. Highest increase is caused by salinity.
Wea means 1 Digit Code from The World Meteorlogical Organization, Code source WMO 4501.
# converting date as date format. %d/%m/%Y stays for day/month/year
cast$date<-as.Date(cast$Date, format = "%d/%m/%Y")
cast$Day <- as.numeric(format(cast$date, "%d"))
cast$Month <- as.numeric(format(cast$date, "%m"))
# Transforming Wea to factor
cast$Wea<-factor(cast$Wea)
We can start writing our model. For this model we will use cast table.
Wavemodel1<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Wet_T+Barometer+Ac_Line+Ac_Sta+Secchi+Cloud_Amt+Wea+Day+Month+Year,cast)
summary(Wavemodel1)
##
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Wet_T +
## Barometer + Ac_Line + Ac_Sta + Secchi + Cloud_Amt + Wea +
## Day + Month + Year, data = cast)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7112 -1.0929 -0.1166 0.8754 17.8671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.575079 12.490137 7.412 1.82e-13 ***
## Wave_Prd 0.347410 0.019901 17.457 < 2e-16 ***
## Wind_Spd 0.184925 0.006615 27.956 < 2e-16 ***
## Dry_T -0.231620 0.043794 -5.289 1.36e-07 ***
## Wet_T 0.023468 0.039811 0.589 0.555606
## Barometer -0.041558 0.011135 -3.732 0.000195 ***
## Ac_Line 0.003222 0.003779 0.853 0.394003
## Ac_Sta 0.013859 0.001599 8.668 < 2e-16 ***
## Secchi 0.014854 0.005984 2.482 0.013132 *
## Cloud_Amt -0.010642 0.022393 -0.475 0.634682
## Wea1 0.154305 0.160445 0.962 0.336299
## Wea2 -0.209566 0.220517 -0.950 0.342056
## Wea3 -0.831516 1.238643 -0.671 0.502099
## Wea4 0.158467 0.325142 0.487 0.626045
## Wea5 0.006820 0.442387 0.015 0.987701
## Wea6 -2.422808 0.489680 -4.948 8.13e-07 ***
## Wea8 -1.283431 1.748368 -0.734 0.462989
## Day 0.026084 0.014690 1.776 0.075947 .
## Month 0.017376 0.011395 1.525 0.127451
## Year -0.024468 0.002949 -8.297 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.729 on 2010 degrees of freedom
## (33614 observations deleted due to missingness)
## Multiple R-squared: 0.4918, Adjusted R-squared: 0.487
## F-statistic: 102.4 on 19 and 2010 DF, p-value: < 2.2e-16
In first model, we observing that there is lots of non-significant variables. So we decided to remove one of them and try again. We decided to remove Cloud_Amt variable because it has the highest p-value among the significant variables.
Wavemodel2<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Wet_T+Barometer+Ac_Line+Ac_Sta+Secchi+Wea+Day+Month+Year,cast)
summary(Wavemodel2)
##
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Wet_T +
## Barometer + Ac_Line + Ac_Sta + Secchi + Wea + Day + Month +
## Year, data = cast)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7493 -1.0906 -0.1107 0.8783 17.7924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.760059 12.338713 7.923 3.78e-15 ***
## Wave_Prd 0.351043 0.019703 17.817 < 2e-16 ***
## Wind_Spd 0.184235 0.006555 28.105 < 2e-16 ***
## Dry_T -0.233693 0.042292 -5.526 3.70e-08 ***
## Wet_T 0.022726 0.038765 0.586 0.5578
## Barometer -0.043828 0.011054 -3.965 7.60e-05 ***
## Ac_Line 0.001906 0.003717 0.513 0.6082
## Ac_Sta 0.013860 0.001566 8.850 < 2e-16 ***
## Secchi 0.014478 0.005918 2.446 0.0145 *
## Wea1 0.106688 0.128473 0.830 0.4064
## Wea2 -0.299777 0.140573 -2.133 0.0331 *
## Wea3 -0.894834 1.228497 -0.728 0.4665
## Wea4 0.154849 0.269763 0.574 0.5660
## Wea5 -0.083540 0.409371 -0.204 0.8383
## Wea6 -2.533865 0.458737 -5.524 3.75e-08 ***
## Wea8 -1.343917 1.740985 -0.772 0.4402
## Day 0.028017 0.014546 1.926 0.0542 .
## Month 0.018982 0.011307 1.679 0.0934 .
## Year -0.025837 0.002905 -8.895 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.728 on 2037 degrees of freedom
## (33588 observations deleted due to missingness)
## Multiple R-squared: 0.4939, Adjusted R-squared: 0.4894
## F-statistic: 110.4 on 18 and 2037 DF, p-value: < 2.2e-16
Our adjusted R-squared value increased little bit. But we can see that there is still non-significant variables. Ac_Line has the highest p-value among the significant variables. So we decided to remove this variable and try again.
Wavemodel3<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Wet_T+Barometer+Ac_Sta+Secchi+Wea+Day+Month+Year,cast)
summary(Wavemodel3)
##
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Wet_T +
## Barometer + Ac_Sta + Secchi + Wea + Day + Month + Year, data = cast)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7447 -1.0893 -0.1052 0.8826 17.8190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.818072 12.162783 8.125 7.69e-16 ***
## Wave_Prd 0.351447 0.019684 17.855 < 2e-16 ***
## Wind_Spd 0.184482 0.006536 28.224 < 2e-16 ***
## Dry_T -0.226462 0.039864 -5.681 1.53e-08 ***
## Wet_T 0.019877 0.038358 0.518 0.6044
## Barometer -0.043862 0.011052 -3.969 7.48e-05 ***
## Ac_Sta 0.013644 0.001508 9.047 < 2e-16 ***
## Secchi 0.014956 0.005843 2.560 0.0106 *
## Wea1 0.108375 0.128407 0.844 0.3988
## Wea2 -0.294861 0.140220 -2.103 0.0356 *
## Wea3 -0.885305 1.228134 -0.721 0.4711
## Wea4 0.150568 0.269585 0.559 0.5766
## Wea5 -0.076688 0.409079 -0.187 0.8513
## Wea6 -2.523058 0.458170 -5.507 4.12e-08 ***
## Wea8 -1.340377 1.740657 -0.770 0.4414
## Day 0.026329 0.014166 1.859 0.0632 .
## Month 0.018813 0.011301 1.665 0.0961 .
## Year -0.026299 0.002761 -9.526 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.728 on 2038 degrees of freedom
## (33588 observations deleted due to missingness)
## Multiple R-squared: 0.4938, Adjusted R-squared: 0.4896
## F-statistic: 117 on 17 and 2038 DF, p-value: < 2.2e-16
In third iteration, adjusted R-squared stays same but some variables still needed to removed. Removing Wet_T and checking new model.
Wavemodel4<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Barometer+Ac_Sta+Secchi+Wea+Day+Month+Year,cast)
summary(Wavemodel4)
##
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Barometer +
## Ac_Sta + Secchi + Wea + Day + Month + Year, data = cast)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7623 -1.0993 -0.1099 0.8817 17.8223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.052997 12.152151 8.151 6.23e-16 ***
## Wave_Prd 0.351329 0.019679 17.853 < 2e-16 ***
## Wind_Spd 0.184435 0.006535 28.225 < 2e-16 ***
## Dry_T -0.208691 0.020322 -10.269 < 2e-16 ***
## Barometer -0.044383 0.011004 -4.033 5.70e-05 ***
## Ac_Sta 0.013636 0.001508 9.044 < 2e-16 ***
## Secchi 0.014944 0.005842 2.558 0.0106 *
## Wea1 0.107200 0.128364 0.835 0.4037
## Wea2 -0.289949 0.139875 -2.073 0.0383 *
## Wea3 -0.876692 1.227801 -0.714 0.4753
## Wea4 0.173740 0.265803 0.654 0.5134
## Wea5 -0.052051 0.406234 -0.128 0.8981
## Wea6 -2.507660 0.457123 -5.486 4.63e-08 ***
## Wea8 -1.353820 1.740151 -0.778 0.4367
## Day 0.027150 0.014075 1.929 0.0539 .
## Month 0.018715 0.011297 1.657 0.0977 .
## Year -0.026156 0.002746 -9.524 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.727 on 2039 degrees of freedom
## (33588 observations deleted due to missingness)
## Multiple R-squared: 0.4938, Adjusted R-squared: 0.4898
## F-statistic: 124.3 on 16 and 2039 DF, p-value: < 2.2e-16
Our adjusted R-squared value increasing little by little. Continueing to remove non-significant variables.
Wavemodel5<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Barometer+Ac_Sta+Secchi+Wea+Day+Year,cast)
summary(Wavemodel5)
##
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Barometer +
## Ac_Sta + Secchi + Wea + Day + Year, data = cast)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7901 -1.0941 -0.1138 0.8904 17.7706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.976438 12.139951 8.071 1.18e-15 ***
## Wave_Prd 0.350638 0.019683 17.814 < 2e-16 ***
## Wind_Spd 0.184407 0.006537 28.208 < 2e-16 ***
## Dry_T -0.207761 0.020323 -10.223 < 2e-16 ***
## Barometer -0.043896 0.011005 -3.989 6.88e-05 ***
## Ac_Sta 0.013573 0.001508 9.001 < 2e-16 ***
## Secchi 0.014875 0.005845 2.545 0.0110 *
## Wea1 0.106712 0.128419 0.831 0.4061
## Wea2 -0.288913 0.139933 -2.065 0.0391 *
## Wea3 -0.883000 1.228320 -0.719 0.4723
## Wea4 0.186126 0.265812 0.700 0.4839
## Wea5 -0.037517 0.406312 -0.092 0.9264
## Wea6 -2.481160 0.457039 -5.429 6.35e-08 ***
## Wea8 -1.387281 1.740778 -0.797 0.4256
## Day 0.027306 0.014080 1.939 0.0526 .
## Year -0.025803 0.002739 -9.420 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.728 on 2040 degrees of freedom
## (33588 observations deleted due to missingness)
## Multiple R-squared: 0.4931, Adjusted R-squared: 0.4894
## F-statistic: 132.3 on 15 and 2040 DF, p-value: < 2.2e-16
Removing Day variable and checking new model.
Wavemodel6<- lm(Wave_Ht~Wave_Prd+Wind_Spd+Dry_T+Barometer+Ac_Sta+Secchi+Wea+Year,cast)
summary(Wavemodel6)
##
## Call:
## lm(formula = Wave_Ht ~ Wave_Prd + Wind_Spd + Dry_T + Barometer +
## Ac_Sta + Secchi + Wea + Year, data = cast)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.865 -1.125 -0.167 0.907 35.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.069e+02 7.797e+00 13.704 < 2e-16 ***
## Wave_Prd 3.505e-01 1.352e-02 25.917 < 2e-16 ***
## Wind_Spd 1.806e-01 4.308e-03 41.912 < 2e-16 ***
## Dry_T -1.828e-01 1.162e-02 -15.729 < 2e-16 ***
## Barometer -5.033e-02 7.053e-03 -7.136 1.10e-12 ***
## Ac_Sta 1.208e-02 9.823e-04 12.293 < 2e-16 ***
## Secchi 1.942e-02 3.664e-03 5.301 1.20e-07 ***
## Wea1 1.290e-02 7.883e-02 0.164 0.87001
## Wea2 -2.409e-01 8.622e-02 -2.794 0.00522 **
## Wea3 -8.513e-01 1.276e+00 -0.667 0.50473
## Wea4 1.887e-02 1.641e-01 0.115 0.90844
## Wea5 -2.008e-01 2.707e-01 -0.742 0.45831
## Wea6 -1.505e+00 3.050e-01 -4.935 8.29e-07 ***
## Wea8 -2.133e-01 9.056e-01 -0.236 0.81383
## Year -2.707e-02 1.760e-03 -15.374 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.802 on 5010 degrees of freedom
## (30619 observations deleted due to missingness)
## Multiple R-squared: 0.4464, Adjusted R-squared: 0.4448
## F-statistic: 288.5 on 14 and 5010 DF, p-value: < 2.2e-16
So this iteration seems the best model. We can’t remove wea because wea6 is very significant. Other variables are also significant. So we can say that our model is good. We can check residuals vs fitted values to check if there is pattern.
ggplot(Wavemodel6, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals")
Figure 15: Residuals vs Fitted values
In figure 15, we can say that there is pattern pattern in residuals vs fitted values. So we can not say that our model is strong. But we wanted to check Q-Q residuals plot to check if residuals are normally distributed.
ggplot(Wavemodel6, aes(sample = .resid)) +
stat_qq() +
stat_qq_line()
Figure 16: Q-Q residuals plot
In figure 16, we can see that residuals are normally distributed. Q-Q plot can be helpful in understanding how well the errors fit a normal distribution. The absence of a curved line could mean that the errors do not follow a normal distribution and that the model may make poor predictions.